A convolutional neural network based tool for predicting protein AMPylation sites from binary profile representation

AMPylation is an emerging post-translational modification that occurs on the hydroxyl group of threonine, serine, or tyrosine via a phosphodiester bond. AMPylators catalyze this process as covalent attachment of adenosine monophosphate to the amino acid side chain of a peptide. Recent studies have shown that this post-translational modification is directly responsible for the regulation of neurodevelopment and neurodegeneration and is also involved in many physiological processes. Despite the importance of this post-translational modification, there is no peptide sequence dataset available for conducting computation analysis. Therefore, so far, no computational approach has been proposed for predicting AMPylation. In this study, we introduce a new dataset of this distinct post-translational modification and develop a new machine learning tool using a deep convolutional neural network called DeepAmp to predict AMPylation sites in proteins. DeepAmp achieves 77.7%, 79.1%, 76.8%, 0.55, and 0.85 in terms of Accuracy, Sensitivity, Specificity, Matthews Correlation Coefficient, and Area Under Curve for AMPylation site prediction task, respectively. As the first machine learning model, DeepAmp demonstrate promising results which highlight its potential to solve this problem. Our presented dataset and DeepAmp as a standalone predictor are publicly available at https://github.com/MehediAzim/DeepAmp.

www.nature.com/scientificreports/ the hydroxyl group of threonine, serine, or tyrosine via a phosphodiester bond. In the AMPylation process, Adenosine Monophosphate (AMP) gets covalently attached to the amino acid side chain of a protein molecule. AMPylation involves a phosphodiester bond between a hydroxyl group of the molecule undergoing AMPylation and the phosphate group of the adenosine monophosphate nucleotide (i.e. adenylic acid) 14 . The enzymes that are capable of catalyzing this process are called AMPylators. Threonine (T) and Tyrosine (Y) amino acids are usual targets of AMPylation while this PTM can sometimes be observed in Serine (S) as well.
Recent proteomics studies demonstrated that this PTM is more common than generally acknowledged and it is emerging as a significant regulatory mechanism for both eukaryotic and prokaryotic cells. It is impelled in a vast area of biological processes stretching from regulation of nitrogen metabolism in bacteria and regulation of signaling pathways to pathogenesis in several animal species [11][12][13][14] . AMPylation has also found to play a significant role in the regulation of neurodevelopment and neurodegeneration 15 .
To the best of our knowledge, so far no computational approach has been proposed for predicting AMPylation sites of Fic domain protein. One of the main reasons is that there is no AMPylation dataset available to be used for this task. In this study, we are presenting a new dataset of protein AMPylation sites. Furthermore, we also propose a new deep convolutional neural network (CNN) model called DeepAmp for predicting protein AMPylation sites on the newly found dataset of AMP modified proteins. DeepAmp achieves 77.7%, 79.1%, 76.8%, 0.55, and 0.85 in terms of Accuracy, Sensitivity, Specificity, Matthews Correlation Coefficient (MCC), and Area Under Curve (AUC) for AMPylation site prediction task, respectively. As the first machine learning model, DeepAmp demonstrate promising results which highlight its potential to solve this problem. We believe this study will help researchers immensely in terms of mitigating the current research gap in this subject. Our presented dataset and DeepAmp as an standalone predictor are publicly available at https:// github. com/ Mehed iAzim/ DeepA mp.

Results and discussion
Evaluation metrics. In order to ensure standardized evaluation of our model and to provide more insights into our results, we calculate the Accuracy, Sensitivity, Specificity, and Mathews correlation coefficient (MCC) as the evaluation metrics. These metrics are characterized by the following equations: where tp denotes true positive, and tn, fp, fn denote true negative, false positive, and false negative, respectively.
Additionally, to show the model's distinguishing capability between AMPylated and non-AMPylated sites, we calculated the Area Under the Curve (AUC). AUC measures the ability of a classifier to distinguish between classes. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative instances. AUC value of 1 indicates that the classifier can differentiate between all the Positive and the Negative class points, correctly. While, an AUC value of 0 indicates poor performance 30 . Comparison with different machine learning techniques. Since DeepAmp is the first computational model proposed to predict AMPylation PTM, it is not possible to compare model performance with any other studies. However, to investigate the effectiveness of CNN to build DeepAmp, we compare it with other ML models to solve this problem. Results achieved using DeepAmp compared to other ML models including Support Vector Machine (SVM), Random Forest (RF), Linear Regression (LR), Decision Tree (DT), and K-Nearest Neighbor (KNN) using same set of features are presented in Tables 1 and 2 for fivefold and tenfold cross-validations, respectively. We present the average of 10 runs of fivefold and tenfold cross-validations model for all the metrics in Tables 1 and 2. As shown in these tables, DeepAmp achieves significantly better results in terms of all four metrics than other machine learning methods which are investigated in this study.
As shown in Table 1, DeepAmp achieves 75.9%, 77.2%, 75.2%, 0.52, and 0.84 in terms of Accuracy, Sensitivity, Specificity, MCC, and AUC for AMPylation site prediction task using fivefold cross validation, respectively. Also, according to Table 2, DeepAmp achieves 77.7%, 79.1%, 76.8%, 0.55, and 0.85 in terms of Accuracy, Sensitivity, Specificity, MCC, and AUC for AMPylation site prediction task using tenfold cross validation, respectively. As  Tables 1 and 2, the results using tenfold cross-validation are slightly better than those reported using fivefold cross-validation. This can be associated with larger number of samples used to train our model in tenfold cross-validation. In k-fold we evaluate the model on 1/k part of the data and train on the rest. Therefore, for our dataset, in tenfold we are using 362 samples for training while in fivefold we are using 320 samples for training in each iteration. As a result, there are more samples available to train the model using tenfold rather than 5-fold. This also suggest that by having a larger dataset, DeepAmp is able to achieve even better results.
In Fig. 1, the receiver operating characteristic curves (ROC curves) clearly illustrate the capability of distinguishing the AMPylation and non-AMPylation sites of the DeepAmp model. To provide further information for the readers, the ROC curve for fivefold and tenfold cross validation (ROC curve for each fold) are also provided as supplementary materials (Figs. S1, S2). Also, as shown in Tables 1 and 2, in terms of the MCC score, the other ML models display mediocre classification quality, conversely, DeepAmp shows significant improvement in the classification quality. It demonstrate the effectiveness of DeepAmp over other classifiers in identification of positive and negative samples, consistently.

Methods and materials
This section describes the proposed method and benchmark dataset presented in this study.  www.nature.com/scientificreports/ Benchmark dataset. Kielkowski et al. 9 has identified the AMPylation in intact cancer cells via LC-MS/MS as well as imaging methods. Using a pronucleotide probe they identified the protein AMPylation in living cells.
They synthesized an N6-propargyl adenosine phosporamidate proneucleotide (pro-N6pA) and treated different cell lines such as HeLa, SH-SY5Y, etc. to identify the sites. The AMPylated proteins found here are engaged in a variety of metabolic pathways, including a widely conserved key regulator of glycolysis ATP-dependent 6-phosphofructokinase (PFKP), proteolysis (CTSA, CTSB), regulation of PTMs (PPME1), and UPR (HSPA5 and SQSTM1). They identified a total of 162 protein sequences to be involved in this distinct modification. We investigated these proteins through UniProt database and identified a total of 133 unique protein sequences which are used to build our dataset. We then use CD-Hit to remove proteins with over 40% sequential similarities to discard redundancy in the dataset 31 . The resulting dataset contains 130 unique proteins with less than 40% sequential similarities. After that, for each AMPylation and non AMPylation site, a 31-residue peptide containing central AMPylation /non AMPylation site with 15 residues upstream and 15 residues downstream was extracted. We tried different length of peptide-containing which among them, using 31-residue peptides attained the best results. To build the peptides sequence for AMPylation sites at the two ends of the proteins with less than 15 neighboring amino acids on each side, we use equalized by padding with "X" residue. As a result, a total of 153 peptides with AMPylated sites and 28872 peptides with non-AMPylated sites were extracted from 130 protein sequences. From the 28872 non-AMPylated sites, we selected 250 sequences randomly to balance our dataset having almost 2:1 ratio of negative to positive samples. Thus our final dataset of 403 peptide sequences with 153 AMPylated peptides and 250 non-AMPylated peptides was created. This dataset is available at https:// github. com/ Mehed iAzim/ DeepA mp. Feature encoding. Feature encoding is an important step in building an effective machine learning model. Binary profile features (also known as one-hot-encoding) are straightforward, yet shown to be very effective for the prediction of different functionalities in the multi-omics dataset 32,33 . In this study, we generate Binary profiles for each peptide, by representing each amino acid as a vector of 20 dimensions in term of one hot encoding. For instance, Alanine is replaced by a 20 size one hot vector which is [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]. As a result, a sequence of length L was represented by a vector of dimensions L 20. Considering L= 31 (length of peptides), we extract 620 features for each peptide (31 20). This feature encoding process is depicted in Fig. 2. Considering that we use convolutional neural network to build our model, binary profile can potentially provide extensive information to train our model.

Classification technique. Convolutional neural network (CNN) is widely used in computational biology
for predicting different biological and chemical functionalities and entities from multi-omics datasets. It has shown tremendous success in the prediction of different PTMs, cancer cell types classification tasks, origins of replication prediction, and many more [34][35][36] . Like any other neural network, a CNN consists of an input layer, hidden layer, and an output layer. Extracting feature maps using convolution operation makes the CNN architecture different from the regular neural nets. Unlike hidden layers of regular neural net which basically constructed by a set of fully connected neurons, the hidden layers of CNN mainly consist of a convolutional layer, pooling layer, and fully connected layer 37 .
The CNN architecture we used is depicted in Fig. 3. Our CNN classifier consists of three Conv1D layers with the number of filters and kernel sizes of [24,7], [16,5] and [8,3], respectively. We also use two Maxpooling1D layers as well as two Dense layers. The input is the L 20 matrix where L is the length of the protein sequence (31). We applied one-dimensional kernels to the input vectors. The output of our first 1-D convolutional layer which can also be thought of as a motif scanner is then passed to the max-pooling layer. Among the three convolutional layers we used, max-pooling was applied in the first two of them. The last convolutional layer output is directly passed to a fully connected layer and the prediction layer. Rectified Linear Unit (ReLU) was used as activation function for each intermediate layer as it is popularly used for its simplicity and effectiveness 38,39 . In each of the convolutional layers and the fully connected layer, we used dropout to avoid overfitting 40 . www.nature.com/scientificreports/ Even though for computer vision problems deeper CNN models provide the best result 41 , for biological sequence data which are presented in term of matrix as input, different studies have shown that increasing the depth of the convolutional layer does not necessarily lead to improvement in prediction accuracy specially for the smaller datasets similar to ours 42 . Furthermore, it reduces the chance of overfitting and requires fewer instances for training 40,43 . In order to prevent overfitting, we develop a shallow CNN architecture. With only 7825 trainable parameters the model provides a balanced result. Additionally, to prevent the overfitting we use two regularization methods namely, dropout and L2 for each Conv1D.
Evaluation methods. In order to measure the efficacy of DeepAmp, k-fold cross-validation is used here.
In k-fold cross-validation, the dataset is split into k subsets. From this k subset, k-1 is used for training and the remaining fold is used for validation. This way the whole dataset gets used for training. Since the training size gets bigger, the classifiers tend to show better results. We used stratified k-fold cross validation which maintains a fixed ratio of negative and positive sites in the training and validation dataset 44 . In this study, we evaluate our model using k = 5 and 10 as two common values for this parameter.

Conclusion
In this study, we presented a new dataset that can be used to evaluate computational methods specially machine learning based models to predict AMPylation PTM. On top of that, we proposed a new deep learning-based tool called DeepAmp for predicting AMPylation using CNN and binary profile feature vector. DeepAmp achieves an accuracy of 77.7% and sensitivity, specificity, MCC, and AUC score of 79.1%, 76.8%, 0.55, and 0.85 respectively for tenfold cross-validation. DeepAmp also significantly outperforms widely used machine learning models including Support Vector Machine, K-nearest Neighbor, and Random Forest for predicting AMPylation sites. Due to the limitation of the sample size available, prediction with high accuracy is strenuous. In the future refinement of our work, we aim to incorporate new AMPylation sites into the dataset and create a larger database for AMPylation PTM. Furthermore, we aim to ameliorate our predictor's performance by using different feature sets and deeper CNN architectures. Our presented dataset and DeepAmp as an standalone predictor are publicly available at https:// github. com/ Mehed iAzim/ DeepA mp.